You are currently looking at version 1.0 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.
In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling.
import nltk
import pandas as pd
import numpy as np
# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
moby_raw = f.read()
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)
How many tokens (words and punctuation symbols) are in text1?
This function should return an integer.
def example_one():
return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)
example_one()
255028
How many unique tokens (unique words and punctuation) does text1 have?
This function should return an integer.
def example_two():
return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))
example_two()
20742
After lemmatizing the verbs, how many unique tokens does text1 have?
This function should return an integer.
from nltk.stem import WordNetLemmatizer
def example_three():
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]
return len(set(lemmatized))
example_three()
16887
What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)
This function should return a float.
def answer_one():
diversity = float(len(set(nltk.word_tokenize(moby_raw)))/len(nltk.word_tokenize(moby_raw)))
return diversity # Your answer here
answer_one()
0.08133224587104161
def answer_two():
from nltk.book import FreqDist
token_dict = FreqDist(moby_tokens)
return (((token_dict['whale'] + token_dict['Whale'])*100)/float(len(nltk.word_tokenize(moby_raw))))
answer_two()
0.41250372508116756
What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?
This function should return a list of 20 tuples where each tuple is of the form (token, frequency)
. The list should be sorted in descending order of frequency.
def answer_three():
from nltk.book import FreqDist
import operator
token_dict = FreqDist(moby_tokens)
sorted_token_dict = sorted(token_dict.items(), key=operator.itemgetter(1))
lst = sorted_token_dict[-20:]
lst.reverse()
return lst
answer_three()
[(',', 19204), ('the', 13715), ('.', 7306), ('of', 6513), ('and', 6010), ('a', 4545), ('to', 4515), (';', 4173), ('in', 3908), ('that', 2978), ('his', 2459), ('it', 2196), ('I', 2113), ('!', 1767), ('is', 1722), ('--', 1713), ('with', 1659), ('he', 1658), ('was', 1639), ('as', 1620)]
What tokens have a length of greater than 5 and frequency of more than 150?
This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use sorted()
def answer_four():
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
text = word_tokenize(moby_raw)
dist = FreqDist(text)
vocab1 = dist.keys()
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 150]
freqwords.sort()
return freqwords # Your answer here
answer_four()
['Captain', 'Pequod', 'Queequeg', 'Starbuck', 'almost', 'before', 'himself', 'little', 'seemed', 'should', 'though', 'through', 'whales', 'without']
Find the longest word in text1 and that word's length.
This function should return a tuple (longest_word, length)
.
def answer_five():
import nltk
from nltk.tokenize import word_tokenize
longest_word = None
max_len = 0
text1 = word_tokenize(moby_raw)
for word in text1:
if len(word) > max_len:
longest_word = word
max_len = len(word)
return (longest_word, len(longest_word))# Your answer here
answer_five()
("twelve-o'clock-at-night", 23)
What unique words have a frequency of more than 2000? What is their frequency?
"Hint: you may want to use isalpha()
to check if the token is a word and not punctuation."
This function should return a list of tuples of the form (frequency, word)
sorted in descending order of frequency.
def answer_six():
import operator
from nltk.book import FreqDist
dist = FreqDist(moby_tokens)
unique_words = {}
for words in dist.keys():
if words.isalpha() and dist[words] > 2000:
unique_words[words] = dist[words]
unique_words = sorted(unique_words.items(), key=operator.itemgetter(1))
unique_words.reverse()
result = [(f,w) for w,f in unique_words]
return result # Your answer here
answer_six()
[(13715, 'the'), (6513, 'of'), (6010, 'and'), (4545, 'a'), (4515, 'to'), (3908, 'in'), (2978, 'that'), (2459, 'his'), (2196, 'it'), (2113, 'I')]
def answer_seven():
sen_tokens = nltk.sent_tokenize(moby_raw)
return len(moby_tokens)/len(sen_tokens) # Your answer here
answer_seven()
25.88591149005278
What are the 5 most frequent parts of speech in this text? What is their frequency?
This function should return a list of tuples of the form (part_of_speech, frequency)
sorted in descending order of frequency.
def answer_eight():
import collections
pos_token = nltk.pos_tag(text1)
pos_counts = collections.Counter((subl[1] for subl in pos_token))
return pos_counts.most_common(5)
answer_eight()
[('NN', 32727), ('IN', 28662), ('DT', 25879), (',', 19204), ('JJ', 17613)]
For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.
For every misspelled word, the recommender should find find the word in correct_spellings
that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.
*Each of the three different recommenders will use a different distance measure (outlined below).
Each of the recommenders should provide recommendations for the three default words provided: ['cormulent', 'incendenece', 'validrate']
.
from nltk.corpus import words
correct_spellings = words.words()
For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:
Jaccard distance on the trigrams of the two words.
This function should return a list of length three:
['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
.
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
from nltk.metrics.distance import (
jaccard_distance,
)
from nltk.util import ngrams
spellings_series = pd.Series(correct_spellings)
correct = []
for entry in entries :
spellings = spellings_series[spellings_series.str.startswith(entry[0])]
distances = ((jaccard_distance(set(ngrams(entry, 3)),set(ngrams(word, 3))), word) for word in spellings)
closet = min(distances)
correct.append(closet[1])
return correct
answer_nine()
['corpulent', 'indecence', 'validate']
For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:
Jaccard distance on the 4-grams of the two words.
This function should return a list of length three:
['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
.
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
from nltk.metrics.distance import (
jaccard_distance,
)
from nltk.util import ngrams
spellings_series = pd.Series(correct_spellings)
correct = []
for entry in entries :
spellings = spellings_series[spellings_series.str.startswith(entry[0])]
distances = ((jaccard_distance(set(ngrams(entry, 4)),set(ngrams(word, 4))), word) for word in spellings)
closet = min(distances)
correct.append(closet[1])
return correct
answer_ten()
['cormus', 'incendiary', 'valid']
For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:
Edit distance on the two words with transpositions.
This function should return a list of length three:
['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
.
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
from nltk.metrics.distance import (
edit_distance,
)
spellings_series = pd.Series(correct_spellings)
correct = []
for entry in entries :
spellings = spellings_series[spellings_series.str.startswith(entry[0])]
distances = ((edit_distance(entry,word), word) for word in spellings)
closet = min(distances)
correct.append(closet[1])
return correct
answer_eleven()
['corpulent', 'intendence', 'validate']